Paper Review—Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments

November 01, 2020

Simulation

📖 Link to the Paper - Vision-and-Language Navigation: Interpreting visually-grounded navigation instructions in real environments

Anderson, Peter, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. "Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674-3683. 2018.

💡 Link to Project Webstie - Room-to-Room (R2R) Navigation

Main Contribution

The research problem in this paper is embodied AI, specifically the task of Vision-and-Language Navigation (VLN). This is a practical problem in robotics, where language-empowered intelligent agents adapt to the physical environment. Despite the recent successes in vision and language tasks individually, this combination has not been systematically studied due to the challenge of linking both tasks in an unstructured and unseen environment.

This work pioneered the research of visually-grounded natural language navigation and inspired more recent work to push the boundary forward. The main contributions of this paper include the proposal of Matterport 3D Simulator as a large-scale interactive reinforcement learning environment, Room-to-Room (R2R) as the state-of-art benchmark dataset, and an attention-based sequence-to-sequence model designed to introduce a baseline for the VLN task.

Method

In this work, the authors first introduced a novel Matterport3D Simulator and Room-to-Room task/dataset, and then further investigated the difficulty of this task by proposing several plausible models using this dataset.

Firstly for Matterport3D Simulator, 10,800 densely-sampled panoramic RGBD images of real environments are sampled. The key point is the real-world images, other than the readily available synthesized datasets because no synthesized datasets can level the real image for its rich visual context. Then, based on this simulator, the R2R dataset is prepared to support the R2R task, where an embodied agent intake language instructions to navigate from a starting pose to a goal location.

In the simulator, an embodied agent taking advantage of the panoramic views to virtually “move” throughout the scene, thus R2R addressed the fact that the agent can move and control the camera in comparison to previous benchmarks. In obtaining R2R, the top 3 navigation instructions were collected using Amazon Mechanical Turk in a time-consuming process. The average length of instructions is 29 words (much longer than VQA), and the average trajectory length ~10m. Lastly, a sequence-to-sequence model was proposed, similar to models for VQA, but used ResNet-152, LSTM, and a bottom-up attention mechanism. The LSTM encoder encodes the language tokens, and the LSTM decoder decodes a sequence of actions to take in the environment while keeps track of the agent’s traversing history. At every timestamp, the model receives a new visual observation.

In training, the model is to predict the action the shortest path would take from the current state. Besides, the authors experimented with “teacher-forcing”, where the target word is passed as the next input to the decoder, and “student-forcing”, where the next action is sampled from the previous output probability distribution.

What do I think?

Overall, I think the author did a great job presenting the VLN task and contributed greatly to lay the groundwork for future research, including the proposal of a novel 3D simulator, benchmark dataset, and baseline models.

One limitation the paper mentioned is from its choice of dataset Matterport3D dataset as it comprises clean and tidy scenes of luxurious interiors with hardly any moving objects, such as human or animals.

The simulator could be extended to incorporate depth information so that the agent can learn a semantic depth map of the environment. Though, It’s still very commendable because it’s real-world imagery with rich visual context, important in preventing overfitting. Another implicit limitation is the language model currently only supports English instructions, which is inconvenient for non-English speakers. I expect future works incorporating more powerful language models into VLN task to expand on this.

Future Work

I think future works could be done on extending the model action space from the 6 actions for 30 degrees each (left, right, up, down, forward and stop) to a panoramic viewpoint (360 degrees) for better parametrization and a complete viewpoint. Overall, the combination of vision-language and the interaction in a dynamic environment is highly practical and has an exciting outlook from my point of view.

From my research, this paper is the foundational work for VLN which inspired a stream of research addressing this task, in the efforts to better combat the ambiguity of the language instruction and the partial observability of the agent. This includes combining imitation learning and reinforcement learning [1] and exploring additional tasks as self-supervised signals with self-monitoring agents [2].

Reference

[1] Wang, X., Huang, Q., Celikyilmaz, A., Gao, J., Shen, D., Wang, Y. F., … & Zhang, L. (2019). Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6629-6638).

[2] Ma, C. Y., Lu, J., Wu, Z., AlRegib, G., Kira, Z., Socher, R., & Xiong, C. (2019). Self-monitoring navigation agent via auxiliary progress estimation. arXiv preprint arXiv:1901.03035.